AI-Powered Knowledge Graph Construction for Enterprise Infrastructure
Transform unstructured infrastructure data into intelligent knowledge graphs using LangChain and Neo4j. Production-ready foundation for Graph RAG, infrastructure analysis, and intelligent operations.
- π― Overview
- π§ Foundational Concepts
- π Knowledge Graph Concepts
- β‘ Quick Start
- π Prerequisites
- π οΈ Installation
- βοΈ Configuration
- π Usage
- π API Reference
- π§ͺ Testing
- π€ Contributing
- π Troubleshooting
- π Performance
- π§ Technical Deep Dive
Enterprise organizations struggle to extract actionable intelligence from massive amounts of unstructured infrastructure data (logs, configs, documents, incident reports). Traditional parsing creates disconnected data pointsβthis solution creates intelligent, queryable knowledge graphs.
# INPUT: Raw infrastructure files
/var/log/secure, /etc/httpd/conf/httpd.conf, /var/log/yum.log...
# PROCESSING: AI semantic understanding
LangChain LLMGraphTransformer + LLM API β Intelligent entity extraction
# OUTPUT: Connected knowledge graph
(web-prod-01:Server)-[:RUNS]->(httpd:Service)
(httpd:Service)-[:DEPENDS_ON]->(openssl:Package)
(CVE-2023-12345:Vulnerability)-[:AFFECTS]->(openssl:Package)
# CAPABILITY: Graph RAG queries
"Which production servers are affected by the latest OpenSSL vulnerability?"- π§ AI-Powered: Semantic understanding via LLMs, not brittle regex patterns
- π Intelligent Relationships: Auto-discovers complex dependencies and connections
- π― Graph RAG Ready: Perfect foundation for intelligent retrieval systems
- π Domain Agnostic: Works with IT infrastructure, security, business processes, documents
- β‘ Production Scale: Handles enterprise workloads (1000+ systems, 15K+ entities)
- π Enterprise Ready: Security, configuration management, monitoring integration
If you're new to graph databases, Neo4j, or knowledge graphs, this section explains the fundamental concepts from the ground up.
Traditional Databases (Tables):
servers.csv services.csv packages.csv
server_id | status service_id | port package_id | version
web-01 | active httpd | 80 openssl | 3.0.1
db-02 | active mysql | 3306 httpd | 2.4.53
Problem: Data is disconnected. Hard to answer "Which servers use vulnerable OpenSSL?"
Graph Database (Connected):
[Analytics-Dev-648] ββHOSTSββ> [Httpd] ββUSESββ> [Httpd-Tools]
β β
βββββββHOSTSββ> [Mysql] ββUSESβββββββββββ
Solution: Everything is connected. Easy to trace relationships and dependencies.
Neo4j is a graph database that stores data as nodes (things) and relationships (how things connect):
- Node: A "thing" in your data (server, service, package, person, etc.)
- Relationship: How two nodes connect (HOSTS, RUNS, USES, DEPENDS_ON, etc.)
- Property: Details about nodes (in your system:
idfor identification)
Think of it like a social network for your infrastructure:
[You] ββFRIENDS_WITHββ> [Person] ββWORKS_ATββ> [Company]
[Analytics-Dev-648] ββHOSTSββ> [Httpd] ββUSESββ> [Httpd-Tools]
A knowledge graph is a smart way to organize information so computers can understand relationships and answer complex questions.
Example with your RHEL systems:
Raw Data (what you have):
/var/log/messages: "httpd started on Analytics-Dev-648"
/var/lib/rpm/packages.txt: "httpd-tools-2.4.53 installed"
/etc/redhat-release: "Red Hat Enterprise Linux release 9.3"
Knowledge Graph (what we create):
(Analytics-Dev-648:Server) ββHOSTSββ> (Httpd:Service) ββUSESββ> (Httpd-Tools:Package)
(Red Hat Enterprise Linux:System) ββHOSTSββ> (Httpd:Service)
Smart Questions You Can Now Ask:
- "Which servers would be affected by an Httpd-Tools vulnerability?"
- "What services will stop if I reboot Analytics-Dev-648?"
- "Show me all production web servers and their dependencies"
GraphRAG = Graph + Retrieval Augmented Generation
Traditional RAG:
Question β Search Documents β Send to LLM β Answer
GraphRAG:
Question β Query Knowledge Graph β Get Connected Data β Send to LLM β Smarter Answer
Example:
Question: "Which production servers are vulnerable to CVE-2023-12345?"
Traditional Search: Finds documents mentioning the CVE GraphRAG: Finds CVE β traces to affected packages β traces to services β traces to production servers
Result: Complete impact analysis, not just document search.
Here's how your actual RHEL system files transform into a queryable knowledge graph:
/var/log/messages:
"Jan 15 14:23:01 Analytics-Dev-648 systemd[1]: Started The Apache HTTP Server"
"Jan 15 14:23:15 Analytics-Dev-648 httpd[5678]: Server configured"
/etc/redhat-release:
"Red Hat Enterprise Linux release 9.3 (Plow)"
/var/lib/rpm/packages.txt:
"httpd-tools-2.4.53-11.el9_2.5.x86_64"
"dnf-4.7.0-4.el9.noarch"
The AI reads these files and understands:
- "Analytics-Dev-648" is a server
- "httpd" is a service (Apache web server)
- "systemd" manages services
- "httpd-tools" and "dnf" are packages
- These things have relationships
// Nodes (things discovered)
(:Server {id: "Analytics-Dev-648"})
(:Service {id: "Httpd"})
(:Package {id: "Httpd-Tools"})
(:System {id: "Red Hat Enterprise Linux"})
// Relationships (connections discovered)
(Analytics-Dev-648)-[:HOSTS]->(Httpd)
(Httpd)-[:USES]->(Httpd-Tools)
(Red Hat Enterprise Linux)-[:HOSTS]->(Httpd)All this gets stored in Neo4j where you can query it:
// Find all services on production web servers
MATCH (server:Server)-[:HOSTS|RUNS]->(service:Service)
WHERE server.id CONTAINS "Web-Prod"
RETURN server.id, service.idBefore (Traditional):
- Manual documentation that gets outdated
- Siloed information in different systems
- Hard to understand dependencies
- Slow incident response
After (Knowledge Graph + GraphRAG):
- Automatic discovery of relationships
- Live, connected infrastructure map
- Instant impact analysis
- AI-powered intelligent operations
Real Benefits:
- Security: "Which servers use Httpd-Tools package?" β Instant answer
- Operations: "What breaks if I restart Analytics-Dev-648?" β Complete dependency map
- Compliance: "Show me all production systems with their services" β Automated audit
- Planning: "What's the blast radius of updating Httpd?" β Risk assessment
Think of your infrastructure as a city:
- Buildings = Servers (nodes)
- Residents = Services (nodes)
- Utilities = Packages (nodes)
- Roads = Relationships (how they connect)
Traditional documentation = Static map Knowledge Graph = Live GPS with traffic data
GraphRAG = Smart city assistant that understands the whole system and can answer complex questions about traffic, dependencies, and impacts.
This section explains how abstract knowledge graph terminology maps to your concrete RHEL infrastructure, making it easier to understand what the system creates and how to explain it to stakeholders.
| Knowledge Graph Term | RHEL Infrastructure Meaning | Your Data Examples | Business Value |
|---|---|---|---|
| Entity/Node | Physical or logical IT component | Server, Service, Package | "Things that exist in our infrastructure" |
| Entity Type | Category of IT component | Server (133), Service (41), Package (28) | "Types of infrastructure components" |
| Properties | Characteristics of components | id (unique identifier) |
"Details about each component" |
| Domain | Area of business/IT | RHEL Infrastructure & Systems | "The part of IT we're modeling" |
π₯οΈ Your Specific Entity Types:
- Server (133): Physical/virtual machines (
Analytics-Dev-648,Web-Prod-898,Red Hat Enterprise Linux) - Service (41): Running processes (
Httpd,Mysql,Sshd) - Package (28): Installed software (
Yum,Dnf,Kernel) - Component (28): System parts (
Kernel,Selinux,Storage) - Application (18): Business applications (
Httpd,Analytics,Database)
| Knowledge Graph Term | RHEL Infrastructure Meaning | Your Data Examples | Business Impact |
|---|---|---|---|
| Relationship/Edge | How components interact | Server HOSTS Service | "Dependencies & connections" |
| Relationship Type | Kind of interaction | HOSTS, RUNS, USES, DEPENDS_ON | "Types of dependencies" |
| Graph Traversal | Following connections | Find all services on a server | "Impact analysis queries" |
| Path | Chain of relationships | ServerβServiceβPackageβVulnerability | "Root cause analysis" |
π Your Specific Relationship Types:
- HOSTS (258): Server physically contains service (
Analytics-Dev-648 HOSTS Httpd) - RUNS (83): Server executes service (
Web-Prod-898 RUNS Mysql) - USES (31): Server uses package (
Server USES Package) - DEPENDS_ON (33): Service depends on component (
Service DEPENDS_ON Component) - MANAGES: System controls component (AI-discovered management relationships)
| Abstract Concept | RHEL Infrastructure Translation | Real Business Scenario |
|---|---|---|
| Semantic Search | "Find all web servers with SSL vulnerabilities" | Security team identifies at-risk systems |
| Relationship Discovery | "What services would break if this server fails?" | Incident response planning |
| Graph Traversal | "Trace dependency chain from user request to database" | Performance troubleshooting |
| Entity Linking | "Connect security alerts to affected applications" | Automated incident correlation |
| Knowledge Inference | "If package X is vulnerable, which servers are affected?" | Proactive security patching |
INPUT (Raw RHEL Files):
/var/log/messages: "systemd[1]: Started The Apache HTTP Server"
/etc/redhat-release: "Red Hat Enterprise Linux release 9.3"
/var/log/yum.log: "Installed: httpd-2.4.53-11.el9_2.5.x86_64"
β AI ANALYSIS (LangChain LLMGraphTransformer) β
OUTPUT (Knowledge Graph):
({id: "Analytics-Dev-648"}:Server)-[:HOSTS]->({id: "Httpd"}:Service)
({id: "Web-Prod-898"}:Server)-[:RUNS]->({id: "Mysql"}:Service)
({id: "Red Hat Enterprise Linux"}:System)-[:HOSTS]->({id: "Httpd"}:Service)
For IT Management:
"We've created an intelligent map of our infrastructure that automatically discovers how our 1000+ RHEL systems connect. Instead of manual documentation, AI analyzes system logs and creates a live knowledge graph showing which servers run which services, what software they depend on, and how they connect."
For Security Teams:
"The knowledge graph enables instant impact analysis. When a CVE is announced, we can immediately query: 'Which production servers use vulnerable package X?' and get answers in seconds, not hours of manual investigation."
For Operations Teams:
"We can now ask intelligent questions like 'What would break if I restart server Y?' or 'Show me all services that depend on database Z' and get complete dependency maps for planning maintenance windows."
// Business Question: "What services are hosted by web production servers?"
MATCH (server:Server)-[:HOSTS|RUNS]->(service:Service)
WHERE server.id CONTAINS "Web-Prod"
RETURN server.id, service.id
// Example Result:
// server.id: "Web-Prod-898", service.id: "Httpd"
// server.id: "Web-Prod-898", service.id: "Mysql"
// Knowledge Graph Answer: Shows all services on production web servers
// Business Value: Instant infrastructure inventory for compliance auditsThis transforms your RHEL infrastructure from "a bunch of servers" into "an intelligent, queryable knowledge system" that enables proactive IT operations!
Get a working knowledge graph in 15 minutes:
# 1. Clone and setup environment
git clone https://github.com/rrbanda/dataloader.git
cd dataloader
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 2. Setup Neo4j Desktop (required)
# Download: https://neo4j.com/download/
# Create database "dataloader-db" with password "password"
# Install APOC plugin
# 3. Configure API access
export OPENAI_API_KEY="your-llm-api-key"
export OPENAI_BASE_URL="https://llama-4-scout-17b-16e-w4a16-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com:443/v1"
export MODEL="llama-4-scout-17b-16e-w4a16"
export NEO4J_URI="neo4j://127.0.0.1:7687"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="password"
# 4. Generate test data and verify setup
python utils/rhel_filesystem_generator.py
python test_setup.py # Should show 5/5 tests pass
# 5. Create your first knowledge graph
python -c "
from core.unified_dataloader import get_universal_loader
loader = get_universal_loader()
systems, events = loader.load_all_systems()
print(f' Knowledge graph created: {len(systems)} systems processed')
loader.close()
"
# 6. Explore in Neo4j Desktop Browser
# Query: MATCH (n)-[r]->(m) RETURN n, r, m LIMIT 25Expected Result: Interactive knowledge graph with ~39 nodes and 16 relationships representing intelligent infrastructure analysis.
- LLM API Key (Red Hat AI: Get key, OpenAI: Get key)
- Create account with LLM provider β API Keys section β Create new key
- Memory: 4GB minimum, 8GB recommended for large datasets
- Storage: 2GB free space
- Network: Internet access for AI API calls
# Clone repository
git clone https://github.com/rrbanda/dataloader.git
cd dataloader
# Create isolated environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# Verify installation
python -c "import langchain, neo4j; print(' Dependencies installed')"- Install Neo4j Desktop β Download here
- Create Project β "dataloader-project"
- Add Database:
- Name:
dataloader-db - Password:
password - Version: Latest 5.x
- Name:
- Install APOC Plugin β Select database β Plugins β APOC β Install
- Start Database β Click
βΆοΈ button
Verify Neo4j:
lsof -i :7687 # Should show Neo4j process# LLM Configuration
export OPENAI_API_KEY="your-llm-api-key"
export OPENAI_BASE_URL="https://llama-4-scout-17b-16e-w4a16-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com:443/v1"
export MODEL="llama-4-scout-17b-16e-w4a16"
# Neo4j Configuration
export NEO4J_URI="neo4j://127.0.0.1:7687"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="password"
export NEO4J_DATABASE="neo4j"# Data source configuration
data_sources:
primary_data:
type: "filesystem"
base_path: "simulated_rhel_systems"
file_patterns:
system_info: ["**/system_info.txt"]
logs: ["**/*.log"]
configs: ["**/*.conf", "**/*.yaml"]
# LLM configuration (environment-driven)
llm_config:
enabled: true
# Neo4j configuration (environment-driven)
neo4j_config:
uri: "${NEO4J_URI}"
username: "${NEO4J_USERNAME}"
password: "${NEO4J_PASSWORD}"
database: "${NEO4J_DATABASE}"Understanding the complete journey from raw data to intelligent knowledge graphs:
π STEP 1: DATA GENERATION
βββ utils/rhel_filesystem_generator.py generates realistic RHEL systems
βββ Creates 18 authentic files per system (/var/log/secure, /etc/redhat-release, etc.)
βββ NO LLM USED - Pure file system simulation
βββ Output: simulated_rhel_systems/ directory with realistic data
π STEP 2: DATA LOADING
βββ FilesystemDataSourceAdapter reads generated files
βββ TextProcessor cleans and chunks content
βββ NO LLM USED - Traditional text processing (regex, patterns)
βββ Output: Cleaned, structured text ready for AI analysis
π§ STEP 3: KNOWLEDGE GRAPH CREATION (LLM CORE ROLE)
βββ LangChain LLMGraphTransformer analyzes cleaned text
βββ LLM identifies entities (Server, Service, Package, User, Vulnerability)
βββ LLM infers relationships (RUNS, DEPENDS_ON, USES, AFFECTS)
βββ LLM creates semantic understanding (not just pattern matching)
βββ Output: Intelligent graph nodes and relationships
ποΈ STEP 4: NEO4J STORAGE
βββ Neo4jGraph stores LLM-extracted entities and relationships
βββ Creates queryable knowledge graph in Neo4j Desktop
βββ NO LLM USED - Direct database operations
βββ Output: Interactive graph ready for Graph RAG queries# Step 1: Data Generation
rhel_generator = RHELFilesystemGenerator(num_systems=100)
rhel_generator.generate_all_systems()
# No AI - Creates realistic file content using templates
# Step 2: Data Loading
files = data_source.read_system_files("web-prod-01")
cleaned = text_processor.process_files(files)
# No AI - Traditional text cleaning (remove ANSI, normalize whitespace)
# Step 4: Neo4j Storage
neo4j_graph.add_nodes(extracted_entities)
neo4j_graph.add_relationships(extracted_relationships)
# No AI - Direct database writes# Step 3: Knowledge Graph Creation (LLM CORE FUNCTION)
llm_transformer = LLMGraphTransformer(
llm=ChatOpenAI(model="llama-4-scout-17b-16e-w4a16"),
node_properties=["name", "type", "status"],
relationship_properties=["type", "strength"]
)
# LLM analyzes this text:
input_text = """
Jan 15 14:23:01 web-prod-01 yum[1234]: Installed: httpd-2.4.53-11.el9_2.5.x86_64
Jan 15 14:23:15 web-prod-01 systemd[1]: Started The Apache HTTP Server
Jan 15 14:23:20 web-prod-01 httpd[5678]: AH00558: Could not reliably determine server's FQDN
"""
# LLM INTELLIGENCE CREATES:
entities = [
Node(id="web-prod-01", labels=["Server"], properties={"environment": "production"}),
Node(id="httpd", labels=["Service"], properties={"status": "active", "port": "80,443"}),
Node(id="httpd-2.4.53", labels=["Package"], properties={"version": "2.4.53-11.el9_2.5"})
]
relationships = [
Relationship(source="web-prod-01", target="httpd", type="RUNS"),
Relationship(source="httpd", target="httpd-2.4.53", type="USES"),
Relationship(source="package-install", target="service-start", type="PRECEDED_BY")
]
# β¨ LLM SEMANTIC UNDERSTANDING:
# - Connects package installation β service start β configuration warning
# - Infers web-prod-01 is a production server running Apache
# - Understands httpd = "Apache HTTP Server" = web service
# - Creates temporal relationships between events# What regex/patterns can do:
import re
log_pattern = r"(\w+\s+\d+\s+[\d:]+)\s+(\w+)\s+(\w+)\[(\d+)\]:\s+(.+)"
match = re.match(log_pattern, log_line)
# Extracts: timestamp, hostname, service, PID, message
# Misses: Relationships, context, semantic meaning, entity types# What LLM understanding provides:
llm_analysis = llm_transformer.convert_to_graph_documents([Document(page_content=log_line)])
# Extracts: Entities with proper types and properties
# Infers: Relationships between entities (RUNS, DEPENDS_ON, CAUSES)
# Understands: Context ("httpd" = web service, needs SSL, serves HTTP traffic)
# Creates: Temporal sequences (install β start β error)
# Connects: Cross-system dependencies and impacts# Generate 100 realistic RHEL systems
python utils/rhel_filesystem_generator.py 100
# Output: 1,800 files created
# /simulated_rhel_systems/
# βββ web-prod-01/var/log/secure (SSH logs)
# βββ web-prod-01/var/log/yum.log (package installs)
# βββ db-prod-01/var/log/mysql/error.log (database logs)
# βββ ... (1,800 total files)# Read all 1,800 files, clean and structure
from core.unified_dataloader import get_universal_loader
loader = get_universal_loader()
# Traditional processing (no AI):
# - Read 1,800 files from filesystem
# - Remove ANSI codes, normalize whitespace
# - Apply Grok patterns for log parsing
# - Chunk large files for AI processing
# Result: ~2.5MB of clean, structured text# LLM analyzes all text and creates intelligent graph
systems, events = loader.load_all_systems()
# LLM PROCESSING (the intelligence):
# - Analyzes 2.5MB of text across 100 systems
# - Identifies ~1,500 unique entities (servers, services, packages)
# - Infers ~1,200 relationships between entities
# - Creates semantic understanding of infrastructure
# - Builds temporal event sequences
# Result: Intelligent knowledge graph in Neo4j-- Now you can ask intelligent questions:
MATCH (s:Server)-[:RUNS]->(svc:Service)-[:USES]->(p:Package)
WHERE s.environment = 'production' AND p.name CONTAINS 'ssl'
RETURN s.hostname, svc.name, p.version
-- Find servers affected by security vulnerabilities:
MATCH (s:Server)-[:RUNS]->(svc:Service)-[:USES]->(p:Package)<-[:AFFECTS]-(v:Vulnerability)
WHERE v.severity = 'Critical'
RETURN s.hostname, v.cve_id, p.name# WITHOUT LLM (Traditional):
Data β Pattern Matching β Disconnected Records
# Limited to what you explicitly program
# WITH LLM (Intelligent):
Data β Semantic Understanding β Connected Knowledge Graph
# Discovers relationships and context you didn't programThe LLM is the bridge that transforms raw infrastructure data into intelligent, queryable knowledge! π§
The Universal DataLoader is designed to complement, not replace your existing enterprise tools. Here's how it integrates with common infrastructure:
# Your existing infrastructure (unchanged)
Red Hat Insights βββ Security recommendations & compliance
Splunk/ELK βββ Centralized logging & search
Prometheus/Grafana βββ Metrics monitoring & alerting
rsyslog/fluentd βββ Log collection & forwarding
SIEM tools βββ Security event correlation
# Universal DataLoader adds:
Knowledge Graph βββ AI-powered infrastructure intelligence
Graph RAG βββ Intelligent operations & queriesThe dataloader only reads existing files - it never modifies or interferes:
# What it reads (read-only access):
/var/log/messages # Standard syslog files
/var/log/secure # SSH authentication logs
/var/log/yum.log # Package installation logs
/etc/redhat-release # System version info
/proc/cpuinfo # Hardware information
# What it NEVER touches:
- Log collection configurations (rsyslog.conf)
- Red Hat Insights client settings
- Active databases or services
- Network configurations
- Security policies# Complementary intelligence
Red Hat Insights: SaaS security analysis & recommendations
Universal DataLoader: Local knowledge graph & Graph RAG
# Combined benefits:
insights_data = insights_api.get_vulnerabilities()
local_graph = dataloader.create_knowledge_graph()
intelligent_ops = combine_insights_with_graph_rag(insights_data, local_graph)# Enhanced log intelligence
Splunk/ELK: Centralized log storage & search
Universal DataLoader: AI-powered relationship extraction
# Graph RAG queries become possible:
"Which servers have both SSH failures AND vulnerable packages?"
"Show dependency chain for services affected by security patches"# Infrastructure intelligence
Prometheus/Grafana: Metrics monitoring & dashboards
Universal DataLoader: Semantic understanding of infrastructure
# Intelligent correlation:
"Which high-CPU systems also have recent package vulnerabilities?"
"Show service dependencies for systems with memory alerts"# Resource requirements
CPU Usage: Low - only during batch processing
Memory: 2-4GB during knowledge graph creation
I/O Impact: Read-only file access, no write operations
Network: LLM API calls only (configurable endpoints)
Storage: No additional storage on monitored systems# Maintains enterprise security standards
Data Access: Read-only filesystem access
API Keys: Externalized configuration (environment variables)
Audit Trail: All operations logged
Network: Configurable endpoints (on-premise LLM support planned)
Encryption: Data in transit encrypted (HTTPS/TLS)# Direct API consumption (no file system access)
red_hat_insights_api:
vulnerabilities: "/api/insights/v1/vulnerabilities"
compliance: "/api/insights/v1/compliance"
recommendations: "/api/insights/v1/advisor"
splunk_api:
search: "/services/search/jobs"
saved_searches: "/services/saved/searches"
elasticsearch_api:
search: "/_search"
indices: "/_cat/indices"# Native integrations
satellite_connector:
systems_inventory: "satellite.example.com/api/v2/hosts"
package_management: "satellite.example.com/api/v2/packages"
prometheus_connector:
metrics_query: "prometheus.example.com/api/v1/query"
alert_rules: "prometheus.example.com/api/v1/rules"# Enterprise-grade capabilities
rbac_integration:
active_directory: "LDAP/AD authentication"
role_based_access: "Fine-grained permissions"
audit_compliance:
sox_compliance: "Financial audit trails"
hipaa_compliance: "Healthcare data protection"
gdpr_compliance: "Data privacy controls"- Identify data sources: Which logs/configs to analyze
- Verify permissions: Read-only access to target files
- Network access: LLM API endpoints reachable
- Resource planning: 4-8GB RAM for processing
- Security review: API key management strategy
- Start with test data: Use simulated systems first
- Validate knowledge graph: Check entity extraction quality
- Test Graph RAG queries: Verify intelligent responses
- Monitor resource usage: CPU, memory, network impact
- Security audit: Review API key security
- Integration testing: Verify compatibility with existing tools
- Performance monitoring: Track processing times and accuracy
- User training: Enable teams to use Graph RAG capabilities
- Expand data sources: Add more systems incrementally
- Plan API integrations: Connect to enterprise APIs
# Recommended deployment approach
Phase 1: Development β Test with simulated data
Phase 2: Staging β Small subset of real systems
Phase 3: Production β Full enterprise deployment
Phase 4: Integration β Connect to enterprise APIs# Start with low-risk, high-value data
Tier 1: System info files (static data)
Tier 2: Historical logs (archived data)
Tier 3: Current logs (operational data)
Tier 4: Real-time streams (future capability)# Enterprise security alignment
Data Classification: Treat as internal/confidential
API Key Management: Use enterprise secret management
Access Controls: Implement least-privilege access
Audit Requirements: Log all data access operationsfrom core.unified_dataloader import get_universal_loader
# Create and load systems
loader = get_universal_loader()
systems, events = loader.load_all_systems()
print(f"Created knowledge graph with {len(systems)} systems")
loader.close()# Point to your infrastructure data
config = {
"data_sources": {
"your_data": {
"type": "filesystem",
"base_path": "/path/to/your/logs",
"file_patterns": {
"logs": ["**/*.log", "**/*.txt"],
"configs": ["**/*.conf", "**/*.yaml"]
}
}
}
}
loader = get_universal_loader(config=config)
systems, events = loader.load_all_systems()# Execute intelligent Cypher queries
result = loader.neo4j_graph.query("""
MATCH (s:Server)-[:RUNS]->(svc:Service)
WHERE s.environment = 'production'
RETURN s.name, svc.name, svc.status
""")
for record in result:
print(f"Server: {record['s.name']}, Service: {record['svc.name']}")from utils.rhel_filesystem_generator import RHELFilesystemGenerator
# Generate enterprise-scale test data
generator = RHELFilesystemGenerator(num_systems=100)
generator.generate_all_systems()
print(" Generated 100 enterprise RHEL systems")class UnifiedDataLoader:
def load_all_systems(self) -> Tuple[List[SystemEntity], List[EventEntity]]
"""Load all systems and create knowledge graph"""
def load_system(self, system_id: str) -> Tuple[SystemEntity, List[EventEntity]]
"""Load specific system"""
def close(self) -> None
"""Cleanup connections"""class RHELFilesystemGenerator:
def __init__(self, num_systems: int = 5)
def generate_all_systems(self) -> Dict[str, Dict]
"""Generate realistic RHEL systems"""class UnifiedConfigLoader:
def get_llm_config(self) -> Dict[str, Any]
def get_neo4j_config(self) -> Dict[str, Any]
def get_dataloader_config(self) -> Dict[str, Any]@dataclass
class SystemEntity:
system_id: str
hostname: str
environment: str
services: List[str]
metadata: Dict[str, Any]
@dataclass
class EventEntity:
event_id: str
system_id: str
event_type: str
timestamp: datetime
description: str
metadata: Dict[str, Any]# Activate environment
source .venv/bin/activate
# Run all tests
python -m pytest tests/ -v
# Run with coverage
python -m pytest tests/ --cov=core --cov=config --cov=utils
# Quick setup verification
python test_setup.py # Should show 5/5 tests pass# Test with generated data
python utils/rhel_filesystem_generator.py 10
python tests/test_complete_4phase_pipeline.py# Benchmark with scale data
python utils/rhel_filesystem_generator.py 1000
time python -c "
from core.unified_dataloader import get_universal_loader
loader = get_universal_loader()
systems, events = loader.load_all_systems()
print(f'Processed {len(systems)} systems')
loader.close()
"We welcome contributions! Please follow these guidelines:
# Fork repo, then:
git clone https://github.com/YOUR-USERNAME/dataloader.git
cd dataloader
# Setup development environment
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Create feature branch
git checkout -b feature/your-feature- π΄ Fork the repository
- πΏ Create feature branch from
main - ** Add tests** for new functionality
- π§ͺ Ensure tests pass:
python -m pytest tests/ -v - π Update documentation as needed
- π« Create pull request with clear description
- Style: Follow PEP 8, use
blackformatter - Tests: Maintain >90% test coverage
- Documentation: Update README and docstrings
- Commits: Use conventional commit format
- π Data Source Adapters: APIs, databases, cloud storage
- π― Domain Templates: Healthcare, finance, manufacturing
- π§ͺ Testing: Performance tests, edge cases
- π Documentation: Tutorials, examples, guides
- β‘ Performance: Optimization, parallel processing
# Check if Neo4j is running
lsof -i :7687
# Verify credentials
echo $NEO4J_PASSWORD # Should be 'password'# Verify API configuration
echo $OPENAI_API_KEY
echo $OPENAI_BASE_URL
# Test connection
python -c "from langchain_openai import ChatOpenAI; print(' LLM working')"# Generate sample data
python utils/rhel_filesystem_generator.py
# Check file permissions
ls -la simulated_rhel_systems/- Slow processing: Reduce batch size, check network latency
- Memory errors: Increase system memory or reduce dataset size
- Neo4j issues: Ensure APOC plugin installed, check database status
- π Check documentation and troubleshooting section
- π Search existing issues on GitHub
- π Create detailed issue with environment details and logs
- π¬ Join discussions for questions and community support
| Systems | Processing Time | Entities Created | Memory Usage |
|---|---|---|---|
| 5 | 45 seconds | ~45 entities | 1.2GB |
| 50 | 8 minutes | ~450 entities | 2.1GB |
| 100 | 15 minutes | ~900 entities | 2.8GB |
| 1000 | 3.5 hours | ~9000 entities | 4.2GB |
- Data Sources: Filesystem only (APIs/databases planned)
- File Size: Large files (>50MB) may need chunking
- Languages: Optimized for English text
- LLM Dependency: Requires internet connection for AI processing
- Maximum tested: 1000 systems (15K entities, 12K relationships)
- Rate limits: Depends on LLM provider (~100 requests/minute)
- Hardware requirements: 8GB RAM recommended for 1000+ systems
Traditional log parsing uses rigid patterns:
# Regex approach (limited)
"httpd\[(\d+)\]: (.+)" β Extract PID and message
# Misses context, relationships, semantic meaningLLMs provide semantic intelligence:
# AI approach (intelligent)
"systemd[1]: Started The Apache HTTP Server"
# Understands: Apache = httpd service
# Infers: systemd MANAGES httpd
# Connects: Related to ports 80/443, SSL, web trafficsystem/
βββ /etc/redhat-release β Server entity (version, architecture)
βββ /var/log/messages β Service events, system activities
βββ /var/log/secure β User activities, authentication
βββ /var/log/yum.log β Package installations/updates
βββ /etc/httpd/conf/ β Service configurations
βββ /var/lib/insights/ β Security findings, vulnerabilities
π₯οΈ Server Entities:
# From: /etc/redhat-release
"Red Hat Enterprise Linux release 9.3"
# AI extracts β
Server {
name: "web-prod-01",
rhel_version: "9.3",
environment: "production"
}βοΈ Service Entities:
# From: /var/log/messages
"systemd[1]: Started The Apache HTTP Server"
# AI extracts β
Service {
name: "httpd",
status: "active",
managed_by: "systemd"
}π Intelligent Relationships:
# AI automatically creates:
(web-prod-01:Server)-[:RUNS]->(httpd:Service)
(httpd:Service)-[:USES]->(httpd-package:Package)
(httpd:Service)-[:DEPENDS_ON]->(openssl:Package)With LLM-extracted graphs, you can ask:
"Which production web servers have SSL configuration issues?"
"Show me dependency chains for services with recent security patches"
"What would be impacted if I restart the MySQL service?"The AI creates semantic relationships that enable sophisticated queries impossible with traditional parsing.
Licensed under the Apache License 2.0 - see LICENSE file.
Summary: Commercial use, modification, distribution allowed. β License notice required.
Built With:
- LangChain - LLM application framework
- Neo4j - Graph database platform
- Red Hat AI / OpenAI - LLM API services
Contributors:
- @rrbanda - Creator and maintainer
- Community - See Contributors
π Star this project if it helps you build intelligent infrastructure graphs!
π¬ Questions? Open an issue or discussion.